Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/ENH: Constructors for DatetimeArray/TimedeltaArray #23493

Closed
wants to merge 20 commits into from

Conversation

jbrockmendel
Copy link
Member

Big push on the constructors for DatetimeArray/TimedeltaArray, some progress de-duplicating code from their Index counterparts.

As discussed elsewhere, adds dtype to PeriodArray.__init__

@TomAugspurger can you confirm that the _from_sequences implemented here handle the right cases? It wasn't obvious if object-dtyped array/indexes were supposed to be handled there, but combining those cases was too clean to overlook.

Small fix in DatetimeArray comparison methods, just enough to make the tests work. A separate PR forthcoming PR will do a more thorough fix/improvements of those.

One non-obvious point on which input would be especially welcome is how/when to use the copy kwarg in such a way as to copy at-most-once. (related: deep_copy_if_needed and #21907)

#23491 found during this process, will be addressed in a follow-up.

@pep8speaks
Copy link

pep8speaks commented Nov 4, 2018

Hello @jbrockmendel! Thanks for updating the PR.

Comment last updated on November 04, 2018 at 18:58 Hours UTC

@@ -199,10 +198,11 @@ def _join_i8_wrapper(joinf, **kwargs):

_engine_type = libindex.DatetimeEngine

tz = None
_tz = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tz property is defined below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? (just curious to understand)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly certain that the version currently here is a mistake, since it is overriden by the property definition of tz below. _freq and _tz are set in _simple_new, so I think the idea of also defining them on the class is to make introspecting what attributes exist easy.

_freq = None
_comparables = ['name', 'freqstr', 'tz']
_attributes = ['name', 'freq', 'tz']
timetuple = None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another we'll-need-this-later (when moving from inheritance to composition)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what is this? Is it intended to be public? It's not present in the public API for 0.23.4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stdlib datetime.datetime comparison methods check if this attribute exists and if so return NotImplemented, otherwise raise TypeError for non-datetime objects

if self._has_same_tz(value):
return _to_m8(value)
raise ValueError('Passed item and index have different timezone')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moving this out of the Constructors section for readability

assert expected.freq == dti.freq
assert expected.tz == dti.tz

# broken until ABCDatetimeArray and isna is fixed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be done in the forthcoming PR mentioned in the OP.

pandas/core/arrays/datetimes.py Show resolved Hide resolved

# NB: Among other things not yet ported from the DatetimeIndex
# constructor, this does not call _deepcopy_if_needed
return result

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
# list, tuple, or object-dtype ndarray/Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to turn into an object array here? to_datetime handles all of these cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we could make do without it. I like doing this explicitly because to_datetime is already overloaded and circular.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is horribly inefficient and unnecessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't do it here, to_datetime is going to do this. It may be unnecessary, but it is not horribly inefficient. What is a code smell is the circularity involved in calling to_datetime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then just call array_to_datetime and don’t force the conversion to array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the root problem (referenced in your "circularity" comment, and down below in TimedeltaIndex.__new__) that to_datetime / to_timedelta returns an Index instead of an EA?

Could we have the public to_datetime just be a simple

array = _to_datetime(...)
return DatetimeIndex(array)

so the internal _to_datetime returns the array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the root problem (referenced in your "circularity" comment, and down below in TimedeltaIndex.new) that to_datetime / to_timedelta returns an Index instead of an EA?

It's not the fact that it's an Index so much as that it is a circular dependency. I think I can resolve this in an upcoming commit.

Looking through to_datetime and _convert_listlike_datetimes, I don't see a conversion to ndarray[object].

_convert_listlike_datetimes calls ensure_object.

Sorry, to_datetime has in intermediate datetime64[ns] -> object -> datetime64[ns] conversion? That seems unnecessary.

Not sure what you're referring to. As implemented _from_sequence is specifically for list, tuple, or object-dtype NDArray/Index. datetime64-dtype goes through a different path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convert_listlike_datetimes calls ensure_object.

That's after an

    # these are shortcutable
    if is_datetime64tz_dtype(arg):
        if not isinstance(arg, DatetimeIndex):
            return DatetimeIndex(arg, tz=tz, name=name)
        if tz == 'utc':
            arg = arg.tz_convert(None).tz_localize(tz)
        return arg

    elif is_datetime64_ns_dtype(arg):
        if box and not isinstance(arg, DatetimeIndex):
            try:
                return DatetimeIndex(arg, tz=tz, name=name)
            except ValueError:
                pass

        return arg

So those both avoid conversion to object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensionArray._from_sequence is for any sequence of scalar objects, including a ndarray with a specialized type like datetime64[ns]. It'll be used, for example, in factorize(ExtensionArray).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger thank you for clarifying; I was under the mistaken impression that it was specifically list/tuple/object-dtype.

Are there any restrictions on kwargs that can be added to it? In particular I'm thinking of freq and tz

@classmethod
def _from_sequence(cls, scalars, dtype=_TD_DTYPE, copy=False):
# list, tuple, or object-dtype ndarray/Index
values = np.array(scalars, dtype=np.object_, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't call to_timedelta, so this does require that we pass an object array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it should
let’s not reinvent the wheel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it shouldn't. to_timedelta will just end up calling array_to_timedelta64 like this does, but only after doing a bunch of unecessary dtype checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides this is what TimedeltaIndex.__new__ currently calls

------
ValueError
"""
if dtype != _TD_DTYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When called from _simple_new this is internal so AssertionError would make sense, but it is also called from __new__ so is in principle user-facing.

Either way I need to add tests for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this should never happen all conversions should be before this

so it should assert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype is part of the signature of TimedeltaArray.__new__, which is/will be user-facing. If the user passes the wrong dtype, its a ValueError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no my point is there should be and i think currently there is already conversion

if it’s wrong at this point it’s not a user error but an incorrect path taken

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this check function is called two times, one of which is the very first thing in TimedeltaArray.__new__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the discussion above, is it worth having a 15 line function (including docstrings :-)), for a 2-liner used in two places?
I would maybe simply leave it in place how it was, I think reading something like assert dtype == _TD_DTYPE in TimedeltaArray._simple_new is clearer than calling into a helper function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. But hey, its a nice docstring.

pandas/core/indexes/datetimes.py Outdated Show resolved Hide resolved
@jreback jreback added Datetime Datetime data dtype Timedelta Timedelta data type labels Nov 4, 2018
@jbrockmendel
Copy link
Member Author

Good comments, thanks. Also some linting mistakes I need to fix. I'll make another pass and comment when this is ready to be looked at.

@jbrockmendel
Copy link
Member Author

Updated with (most) requested edits, basic tests for TimedeltaArray (and fixes to make them pass, particularly implementation of is_monotonic_increasing etc)

@jbrockmendel
Copy link
Member Author

Pushed commits fixing tests, also implemented maybe_validate_freq. I think this can be handled more cleverly in some cases, will look into this. There was also an Issue about removing the verify_integrity argument from some of the constructors. Now would be decent time to do so.

assert isinstance(subarr, np.ndarray), type(subarr)
assert subarr.dtype == 'M8[ns]', subarr.dtype

subarr = cls._simple_new(subarr, name=name, freq=freq, tz=tz)
if dtype is not None:
if not is_dtype_equal(subarr.dtype, dtype):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is made unnecessary by the tz = dtl.validate_tz_from_dtype(dtype, tz) above (implemented a couple PRs ago)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this isn't quite right. Will revert/fix.

@TomAugspurger
Copy link
Contributor

@TomAugspurger can you confirm that the _from_sequences implemented here handle the right cases? It wasn't obvious if object-dtyped array/indexes were supposed to be handled there, but combining those cases was too clean to overlook.

It should handle any sequence where the scalar types are instances of ExtensionArray.dtype.type or NA. So yes, object-dtype arrays should be handled.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel could you summarize where this is in the overall TDA/DTA refactor, and how it gets us closer to the goal (primarily disentangling inheritance -> composition? Anything else?)

pandas/core/arrays/datetimelike.py Show resolved Hide resolved
pandas/core/arrays/datetimelike.py Show resolved Hide resolved
def __new__(cls, values, freq=None, tz=None, dtype=None, copy=False):
if isinstance(values, (list, tuple)) or is_object_dtype(values):
values = cls._from_sequence(values, copy=copy)
# TODO: Can we set copy=False here to avoid re-coping?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, then yes you're OK setting copy=False here. By definition, the conversion to datetime64[ns] will involve a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further question: it is not (yet) possible to simply remove this case? (eventually we should not call the DatetimeArray constructor with an array-like of scalars)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not (yet) possible to simply remove this case?

Not if we want to share the extant arithmetic tests (which we do)

(eventually we should not call the DatetimeArray constructor with an array-like of scalars)

I don't share this opinion, would prefer to delay this discussion until it is absolutely necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't share this opinion,

Then please raise this in the appropriate issue, as we have been discussing this before (I think it is #23212, although there is probably some more scattered discussion on other related PRs)

would prefer to delay this discussion until it is absolutely necessary.

It is here that you are redesigning the constructors for the array refactor, IIUC, so if there is a time we should discuss it, it is now I think?

Not if we want to share the extant arithmetic tests (which we do)

Can you clarify this a little bit? At what point do the arithmetic tests need to deal with array of objects?
Eg for boxing the constructed values into Series/Index/Array, there a properly dtyped array can be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this a little bit? At what point do the arithmetic tests need to deal with array of objects?

The pertinent word here is "extant". Many of the tests in tests/arithmetic pass a list into tm.box_expected or klass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the tests for a moment, I thought we were all on board with the goal of the DatetimelikeArray.__init__ being no inference and no copy.

Back to the tests, it looks like you can you add an entry to box_expected for DatetimeArray to return expected = DatetimeArray._from_sequence(expected)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the tests for a moment, I thought we were all on board with the goal of the DatetimelikeArray.init being no inference and no copy.

My comment to Joris below about mothballing this conversation applies. But short answer is no: I did not get on board with that.

pandas/core/arrays/datetimes.py Show resolved Hide resolved
# TODO: Try to do this in just one place
tz = values.dt.tz
values = np.array(values.view('i8'))
elif isinstance(values, DatetimeArrayMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArrayMixin -> cls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you don't need to get the tz in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeArrayMixin -> cls?

No. For the moment we are still using inheritance, so this would mess up for DatetimeIndex == DatetimeArray. When we change to composition this check will have to become isinstance(values, (DatetimeArray, ABCDatetimeIndex))


# NB: Among other things not yet ported from the DatetimeIndex
# constructor, this does not call _deepcopy_if_needed
return result

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
# list, tuple, or object-dtype ndarray/Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through to_datetime and _convert_listlike_datetimes, I don't see a conversion to ndarray[object].


# NB: Among other things not yet ported from the DatetimeIndex
# constructor, this does not call _deepcopy_if_needed
return result

@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
# list, tuple, or object-dtype ndarray/Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the root problem (referenced in your "circularity" comment, and down below in TimedeltaIndex.__new__) that to_datetime / to_timedelta returns an Index instead of an EA?

Could we have the public to_datetime just be a simple

array = _to_datetime(...)
return DatetimeIndex(array)

so the internal _to_datetime returns the array?

@@ -180,6 +203,23 @@ def _generate_range(cls, start, end, periods, freq, closed=None):

return cls._simple_new(index, freq=freq)

# ----------------------------------------------------------------
# Array-Like Methods
# NB: these are appreciably less efficient than the TimedeltaIndex versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of (lack of) caching? This comment makes it seems like it's slower in general, when (if it's caching) it's just slower on repeated use).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW (as mentioned elsewhere), I am not sure we should add them as public methods. If we do so, we should add them to all our EAs, or actually even to the EA interface, and not only to TimedeltaArray (or datetimelike arrays).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do so, we should add them to all our EAs, or actually even to the EA interface

I'm not necessarily opposed to this, but this isn't obvious to me.

Because of (lack of) caching? This comment makes it seems like it's slower in general, when (if it's caching) it's just slower on repeated use).

Because the Index version defines monotonic_increasing, monotonic_decreasing, and is_unique in a single call via _engine.

_freq = None
_comparables = ['name', 'freqstr', 'tz']
_attributes = ['name', 'freq', 'tz']
timetuple = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what is this? Is it intended to be public? It's not present in the public API for 0.23.4

if not isinstance(data, (np.ndarray, Index, ABCSeries,
DatetimeArrayMixin)):
if is_scalar(data):
raise ValueError('DatetimeIndex() must be called with a '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is kinda an API change (raising a TypeError instead of a ValueError).

Seems fine from a consistency P.O.V., but deserves a release note in the API breaking changes section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently all public cases raise ValueError, so keeping it on that would not give an API change?
(I agree that TypeError is slightly more appropriate though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would have been to keep it a ValueError and change to a TypeError in a separate PR, but here we are... will add a note in Breaking Changes.

pandas/core/arrays/datetimes.py Show resolved Hide resolved
tz = values.tz

# TODO: what about if freq == 'infer'?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should also get the freq from the values as a "cheap" inference? Or would there be cases were an inferred frequency can be different than the actual frequency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should also get the freq from the values as a "cheap" inference?

That's what I'm thinking, yah

def __new__(cls, values, freq=None, tz=None, dtype=None, copy=False):
if isinstance(values, (list, tuple)) or is_object_dtype(values):
values = cls._from_sequence(values, copy=copy)
# TODO: Can we set copy=False here to avoid re-coping?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further question: it is not (yet) possible to simply remove this case? (eventually we should not call the DatetimeArray constructor with an array-like of scalars)

if isinstance(values, DatetimeArrayMixin):
if lib.is_scalar(values):
raise TypeError(dtl.scalar_data_error(values, cls))
elif isinstance(values, ABCSeries):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get out the _values, and then treat that the same as directly passing a DatetimeIndex/DatetimeArray ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if there is a graceful way to do this in the next pass (if I ever manage to catch up with all these comments!)

# TODO: Try to do this in just one place
tz = values.dt.tz
values = np.array(values.view('i8'))
elif isinstance(values, DatetimeArrayMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you don't need to get the tz in this case?

if not isinstance(data, (np.ndarray, Index, ABCSeries,
DatetimeArrayMixin)):
if is_scalar(data):
raise ValueError('DatetimeIndex() must be called with a '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently all public cases raise ValueError, so keeping it on that would not give an API change?
(I agree that TypeError is slightly more appropriate though)

@@ -199,10 +198,11 @@ def _join_i8_wrapper(joinf, **kwargs):

_engine_type = libindex.DatetimeEngine

tz = None
_tz = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? (just curious to understand)

data = data.astype(_TD_DTYPE)
else:
data = ensure_int64(data).view(_TD_DTYPE)
arr = TimedeltaArrayMixin(data, freq=freq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should replace the handling of object dtype with TimedeltaArray constructor (this should not be able to handle object dtype eventually), but if needed that is fine to leave for a later PR that actually does the split / cleans up the TimedeltaArray constructor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with the "leave for later" part of this

with pytest.raises(TypeError):
pd.DatetimeIndex(pd.Timestamp.now())

def test_from_sequence_requires_1dim(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a test in the EA base tests.

Although, thinking about it now, I am not sure we should require implementors to handle this case, as the method should never be called with 2D data to begin with.

What is the reason you added handling of this to that method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason you added handling of this to that method?

I'm not sure I understand the question. Is there a reason not to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a case where call _from_sequence internally with 2D data? And if so, what case?

I would think that at the point we internally call _from_sequence, we are sure it is a 1D array (eg as the result of some operation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're far more competent than I am and can be counted on not to make this particular mistake; I cannot.

If we're not allowing object-dtype/lists in __init__/__new__ then we are basically forcing users to use the (private!) _from_sequence; validation need to happen somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _from_sequence method has a specific purpose in the EA interface: converting an array-like of scalars back to a proper Array (and should never be called by a user). Eg this method is used when doing an astype(EAdtype), but I was thinking that it might well be that in those cases that we call it, it already has been checked the data is 1D.

So I am simply honestly wondering if there is a specific case that you encountered now where this check was needed.

But I suppose the answer is that you are using _from_sequence to validate generic user input, and there the user can pass 2D data and we should invalidate it.

However, it is not only you that do it, we actually already do it when doing Series(..., dtype=EAdtype) (where also user input is processed in _from_sequence), so indeed, we should probably have this check in general.

So long story short: it indeed might make sense to add this type check. But then to repeat my initial comment: this should then be a test in the base EA tests to ensure all EAs do this properly?

dti = pd.date_range('2016-01-1', freq='MS', periods=9, tz=tz)

# Fails because np.array(dti, dtype=object) incorrectly returns Longs
result = DatetimeArray(np.array(dti, dtype=object), freq='infer')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned above, but I don't think DatetimeArray constructor should handle this case?

That is what we discussed in #23212 (although, @jbrockmendel , you didn't really react there)

Copy link
Member Author

@jbrockmendel jbrockmendel Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully I've answered this enough times in this thread. I see no reason not to handle object dtype in the DatetimeArray constructor. Series and Index and DataFrame all handle object-dtype and lists; I find it counter-intuitive that we would have a small subset of classes that don't.

But in the short-run, we need to handle these cases if we want to share the extant tests (without significant overhaul) (which we do)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #23493 (comment) for my answer to a previous related comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in this thread you've said you're fine with bikeshedding this after we have a working implementation. Has this changed in the last few minutes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you are referring to #23493 (comment) ? Indeed, I was somewhat inconsistent here, but I think the other comments are quite clear in that I am at least asking why you are adding or keeping handling object dtypes?

If the answer is: because I think that is the way it should be, then let's discuss that and try to get to a consensus (#23212). If the answer is: because of practical reasons for now, then let's discuss the practical reasons and see if that is OK to defer to a follow-up, or if there is an easy way to overcome the practical reason.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche Your attention has been specifically requested in #23514, whereas I am finding it increasingly frustrating. I propose we step back from this conversation so I can spend some time addressing the subset of comments on which there is consensus.

@TomAugspurger
Copy link
Contributor

Maybe we can spend a bit of time building consensus on a direction forward? I'll try to build my own thoughts here on a proposal, as a response / concurrence to
#23185 (comment).

FWIW, everything on the TimedeltaArray / DatetimeArray is on my critical path, so I'm going to prioritize reviewing your PRs over everything else.

@jorisvandenbossche
Copy link
Member

[Tom] I thought we were all on board with the goal of the DatetimelikeArray.init being no inference and no copy.

@jbrockmendel you have to see my comments in the same light. I was assuming we had this discussion and agreed on the design of the constructors, so I was wondering why we couldn't already follow that decision in this PR instead of deferring to a follow-up.
So this mis-assumption might have made things more frustrating than needed for both of us.

Your attention has been specifically requested in #23514, whereas I am finding it increasingly frustrating.

Just as for Tom, this PR (and all issues related to the EA refactor) is high on my priority list, so I will still put time on this topic (which is more than writing comments here).
So you can expect a lot more comments to come (not necessarily now, but when you had the time to update the PR). But please don't see this as a bad sign, but rather as an opportunity to move forward quickly (and as an indicator of the importance of this refactor). Tom's PRs on the SparseArray and PeriodArray also saw a huge amount of comments and discussion, but I think that significantly improved the PRs and moved us relatively quickly to a shared understanding.
And to be clear, it is no problem that it takes some time to process those comments.

If you have specific feedback on how I can make my comments less frustrating apart from the above, I am honestly all ears.


But indeed, let's first try to build consensus on the fundamental design questions. I would propose to not do that here on this PR but on the general issue #23185 and the constructor split-off issue #23212, so we can keep the discussions in this PR on technical implementation details of things we in general already agree on.

I would personally propose to have a chat about this, as a more effective way to discuss things. But @jbrockmendel, I think that only makes sense if you can fix the audio issues we were having last time. If that is not possible right now, I would propose to do it on chat at a given time to at least have it a bit more interactive than on github.

@jbrockmendel
Copy link
Member Author

If you have specific feedback on how I can make my comments less frustrating apart from the above, I am honestly all ears.

I appreciate the consideration. At this point I think the short-term solution is for me to take a half-step back and disengage. The request that you do the same was primarily so you don't feel that I'm becoming non-responsive.

Medium-run, I very much think the priority should be getting enough of this in place that we can get the tests working. Later if you want to implement datetime_array mirroring period_array or rename __init__ to _from_sequence (I'll be renaming _from_sequence to _from_objects or something) then that will be entirely trivial to do.

I'll be going AFK (at least on this thread) for a few hours.

@jbrockmendel
Copy link
Member Author

Good news, for a certain value of "good". In the next pass of de-duplication I found some subtle bugs in the TimedeltaIndex construction, including some that we are specifically testing. So there will be at least one fixing-things PR before this comes back to the forefront.

In the interim, we can make incremental progress on orthogonal DTA/TDA code in #23415 and #23502.

@jorisvandenbossche
Copy link
Member

So there will be at least one fixing-things PR before this comes back to the forefront.

I know I am not at all in the position to force things, so I will just give my opinion (and take it as that :-)): I think we should rather focus our effort on moving forward this PR (eg start trying to reach consensus on the discussion points above).
It's really good you found those bugs, and we should certainly fix them, but those are not critical for the refactor, while the things discussed here are.

In the interim, we can make incremental progress on orthogonal DTA/TDA code in #23415 and #23502.

I thought you mentioned #23415 was kind of dormant for now? On #23502 I added some additional comments.

@jbrockmendel
Copy link
Member Author

I thought you mentioned #23415 was kind of dormant for now?

You thought correctly, but it may be worthwhile to un-dormant-ize it so as to maintain forward momentum while bugs get sorted out.

but those are not critical for the refactor, while the things discussed here are.

Disagree on both counts. Without the bugs being fixed, we can't institute meaningful tests for the array classes. On the other hand, whether we rename __init__/_simple_new to _from_sequence/__init__ can absolutely be postponed until after we have a fully-working implementation.

@jbrockmendel
Copy link
Member Author

Closing. I’ll salvage any useful tests in the PR that comes after #23539.

@jbrockmendel jbrockmendel deleted the pre-constructors branch April 5, 2020 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants